#stochastic vs random | Explore Tumblr posts and blogs

learning-code-ficusoft · 5 months ago

Text

When to use each type of machine learning algorithm

When to Use Each Type of Machine Learning Algorithm Machine learning (ML) algorithms can be broadly categorized into supervised, unsupervised, and reinforcement learning techniques.

Choosing the right algorithm depends on the type of data available and the problem you are trying to solve.

Let’s explore when to use each type of machine learning algorithm. 1. Supervised Learning Supervised learning involves labeled data, meaning the input data has corresponding output labels.

It is used when you have historical data with known outcomes and want to make predictions based on new data.

Use Cases for Supervised Learning Algorithms a. Regression Algorithms (Predicting Continuous Values)

When to use: Predicting a numeric value based on past data.

When the relationship between input features and output is continuous.

Common algorithms:

Linear Regression (e.g., predicting house prices)

Polynomial Regression (e.g., modeling non-linear trends)

Support Vector Regression (SVR) (e.g., stock price prediction)

Example Use Cases:

✔ House price prediction

✔ Sales forecasting

✔ Temperature prediction

b. Classification Algorithms (Categorizing Data) When to use: When the output falls into predefined categories (e.g., spam vs. non-spam). When you need to make decisions based on distinct classes.

Common algorithms: Logistic Regression (e.g., predicting customer churn) Decision Trees & Random Forests (e.g., diagnosing diseases) Support Vector Machines (SVM) (e.g., image classification) Neural Networks (Deep Learning) (e.g., facial recognition)

Example Use Cases:

✔ Email spam detection

✔ Fraud detection in banking

✔ Sentiment analysis of customer reviews

2. Unsupervised Learning Unsupervised learning is used when you have unlabeled data and need to find hidden patterns or structure within it.

Use Cases for Unsupervised Learning Algorithms

a. Clustering Algorithms (Grouping Similar Data) When to use:

When you need to segment or group data based on similarities.

When you don’t have predefined categories.

Common algorithms:

K-Means Clustering (e.g., customer segmentation)

Hierarchical Clustering (e.g., grouping genetic data)

DBSCAN (e.g., anomaly detection in networks)

Example Use Cases:

✔ Customer segmentation for marketing

✔ Anomaly detection in cybersecurity

✔ Identifying patterns in medical images b. Dimensionality Reduction (Feature Selection & Compression)

When to use: When you have high-dimensional data that needs simplification.

To improve model performance by reducing unnecessary features.

Common algorithms:

Principal Component Analysis (PCA) (e.g., image compression)

t-SNE (t-Distributed Stochastic Neighbor Embedding) (e.g., visualizing high-dimensional data)

Example

Use Cases:

✔ Reducing noise in data for better ML performance

✔ Visualizing complex datasets

✔ Improving computational efficiency in AI models

3. Reinforcement Learning Reinforcement learning (RL) is used when an agent learns by interacting with an environment and receiving rewards or penalties based on its actions.

Use Cases for Reinforcement Learning Algorithms

a. Decision-Making & Strategy Optimization When to use:

When the problem involves sequential decision-making.

When an AI system needs to learn through trial and error.

Common algorithms:

Q-Learning (e.g., robotics and game playing)

Deep Q Networks (DQN) (e.g., self-driving cars)

Proximal Policy Optimization (PPO) (e.g., automated trading)

Example Use Cases:

✔ Self-driving cars learning to navigate

✔ AI playing games (e.g., AlphaGo)

✔ Optimizing dynamic pricing strategies

How to Choose the Right Algorithm?

Problem Type Best Algorithm Type Example Use Case Predict a continuous value Regression (Linear, Polynomial, SVR)

House price prediction Categorize data Classification (Logistic, Decision Tree, SVM, Neural Networks)

Spam detection Find hidden patterns Clustering (K-Means, DBSCAN, Hierarchical)

Customer segmentation Reduce dataset complexity Dimensionality Reduction (PCA, t-SNE)

Feature selection in big data Optimize sequential decisions Reinforcement Learning (Q-Learning, PPO) Self-driving cars

Conclusion

Choosing the right machine learning algorithm depends on: The type of data you have (labeled vs. unlabeled) The problem you’re solving (prediction, classification, clustering, etc.)

The complexity and size of the dataset The need for interpretability and computational efficiency Understanding these factors will help you make informed decisions when selecting ML algorithms for real-world applications.

WEBSITE: https://www.ficusoft.in/deep-learning-training-in-chennai/

0 notes

skippydiesposting · 3 years ago

Text

something I think is really interesting when analyzing pieces of media made by creators who are very thorough and precise and in-depth with their themes and foreshadowing and underlying message are the decisions they make about what not to include.

for instance in Skippy Dies you have Ruprecht as a pretty stereotypical nerdy genius kid. we all know that chess is one of the most pervasive and classic tropes used to shorthand a character as booksmart. and there are definitely opportunities to throw that in, he has plenty of other classical "smart kid" interests, we even know that the school has a chess club that he's a part of. but Ruprecht never mentions it. instead, the game that is mentioned multiple times in association with him is Yahtzee. and I think that's really brilliant because although chess could be used as that kind of genius shorthand the way a lot of other things do, it also has very strong thematic motifs. things like war, hierarchy, good vs evil, sacrifice, et cetera. (some of which are in some way thematically represented in other parts of Skippy Dies, because most things that exist are, but I digress.) and those don't really have a lot to do with the rest of Ruprecht's arc and personality and themes. instead we have Yahtzee, which is a game that has very little of a strategical component and is a lot more about random chance and stochasticity. this ties in a lot better with a lot of the other things Ruprecht is fixated on and later struggles with: randomness in the universe, being unable to control an outcome, repetition, et cetera. it makes more sense thematically.

similarly with Nope, a movie that is in large part about the relationship between humans and animals, I think it's interesting to note that there is no mention of dogs. there is, of course, a ton of things to say about the relationship between the human species and dogs, and from an outsider's perspective you might expect that a movie with those themes would have something to say about that: they're the animal that most people have had the most interactions with and they're a huge part of our society. but Nope is focusing specifically on the ways that animals are exploited by people and misunderstood by humans and anthropomorphized in a way that dogs just don't represent thematically. dogs are domesticated and basically co-evolved with humans, and although I do think it's true that Nope should make you think about the relationships and levels of respect you have with all animals in your life, including dogs, it would have weakened the theme that's it's trying to portray of don't fuck with animals you don't understand, and I think that's brilliant. it just really shows an attention to detail and a commitment to message that I really appreciate and love to see.

like I just think that sticking very close to your theme and making all parts relevant and meaningful to the message you're trying to convey is super important, and I think that when the depth and the proper care and attention is given to that message--instead of just shoving every sort-of related thing you could possibly reference into a box--that's what makes really great art.

#idk just something ive been thinking about i just love when things are complex #i would love to hear if anyone has other examples they can think of 👀#skippy dies #paul murray #nope #nope 2022 #jordan peele #analysis

40 notes · View notes

saintcolumbiformes · 5 years ago

Note

☕ punctuated equilibrium vs. gradualism

I haven’t taken a whole class dedicated specifically to evolutionary biology, so my opinion is still forming lol. I do think that the truth lies within a combination of these theories, in line with other opinions I have on biology. I think punctuated equilibrium is valuable when talking about large stochastic events that drive a lot of random change, or periods of time in the past that saw explosions of diversity. But in general, at least in my opinion and experience, change is pretty slow and, well, gradual. However, the whole picture of evolution wouldn’t be complete without both of them being true in some circumstances.

#asks #monstrousgourmandizingcats #bio

2 notes · View notes

welcometomy20s · 2 years ago

Text

April 29, 2023

There are three modes of morality - One can be mistaken, improper, or stochastic.

Truth exists between two realms - the realm of experience and the realm of logic. We take in experiences and we stitch them together in order to form a worldview. Sometimes we don’t have all of the necessary experiences/information or our strands of reasoning might become flawed. One can be mistaken, and that is true of a lot of bad arguments.

To borrow Ian Danskin, Lady Eboshi is wrong but she is not a bad person. She is wrong, in that she has not considered all of the relevant factors or simply dismissed some variables that turn out to be crucial, but she isn’t evil… she isn’t intending to hurt… nature is merely collateral for industry in Eboshi’s eyes… and in that, she is merely mistaken in her pursuit.

But being mistaken doesn’t absolve culpability, it is how you respond to those mistakes that is the most important part. Science is all about accessing and accommodating new information. When a new line of thought is implemented, scientist’s duty is to incorporate this new information. One must not tell what is wrong with the old idea but how we were misled, perhaps the old way is a simplification of this new idea, or there is some factor that was missing…

When one reacts to ‘undo’ the new information in favor of their prior belief is when we must change tactics, even though they were still mistaken as before.

One might be just but that justice can be doled in an improper fashion. The action might have the right motive, to punish wrongdoers, to restore injustice, but the action has some unintended consequence, both positive and negative, that complicates this matter. The action, then, might be questioned on its impropriety. Unfortunately accessing improperness is undecidable - logically, the mistake vs. improper is equivalent assessment of truth values vs. providing a correct proof. First can always be done, but the second might not be.

Finally there is the self-correcting/stochastic behaviors. Sometimes the causes are so blurred as to be equivalent to random. We can call these Acts of God. Many hierarchical societies will create social structure to induce certain behavior to be stochastic. Increasing gun supply will induce more Meursault killings, or killings where the motive is banal or almost unrelated. By enforcing a high moral standard, the hierarchies can slowly excuse these actions as God-like, as part of nature, even though the structure has induced such behaviors.

Restricting this background variable will be met with various logical arguments - a holistic reductive argument where the focus can be pinned on individuals when the problem can only be viewed through a social lens. This trick maintains the social chaos that these hierarchies desire.

Why do these hierarchies desire such harsh material conditions? Hierarchies are reactions to certain kinds of material conditions. The person is sedentary and cannot exit. There is a large surplus which can be used as the seeds of inequality. And the skills are segregated, creating a cross-section of informational asymmetry. More stochastic violence induces a block in movement, a large influx of surplus and large swaths of informational asymmetry.

#morality #philosophy #Thoughts

0 notes

kerlonlol · 3 years ago

Text

Batch gradient descent

#Batch gradient descent update

UpGrad provides a PG Diploma in Machine Learning and AI and a Master of Science in Machine Learning & AI that may guide you toward building a career. One important factor to keep in mind is choosing the right learning rate for your gradient descent algorithm for optimal prediction. You get to know the role of gradient descent in optimizing a machine learning algorithm. Less passes: Usually, the stochastic gradient descent algorithm doesn’t need more than 10 passes to find the best coefficients.Ĭheck out: 25 Machine Learning Interview Questions & Answers Wrapping up.Rescale inputs: The gradient descent algorithm will minimize the cost function faster if all the input variables are rescaled to the same range, such as or.You need to try and see which value works best for you. Learning rate: The learning rate is very low and is often selected as 0.01 or 0.001.If you see the cost to remain unchanged, try updating the learning rate. Map cost versus time: Plotting the cost with respect to time helps you visualize whether the cost is decreasing or not after each iteration.Machine Learning with R: Everything You Need to Knowīest Practices for Gradient Descent Algorithm Top 7 Trends in Artificial Intelligence & Machine Learning Permutation vs Combination: Difference between Permutation and Combination Robotics Engineer Salary in India : All RolesĪ Day in the Life of a Machine Learning Engineer: What do they do? In stochastic gradient descent, the coefficients are updated for each training instance and not at the end of the batch of instances. You can use the stochastic gradient descent in these conditions where the dataset is huge. In these cases, batch gradient descent will take a long time to compute as one iteration needs a prediction for each instance in the training set. In some cases, the training set can be very large. One batch is referred to as one iteration of the algorithm, and this form is known as batch gradient descent. The cost function is computed over the entire training dataset for every iteration. Source Variants of Gradient Descent Algorithm Batch Gradient Descentīatch gradient descent is one of the most used variants of the gradient descent algorithm. On the contrary, a very low learning rate can help you reach the global minima, but the convergence is very slow, taking many iterations. Selecting a very high learning rate can overshoot the global minima. The selection of the learning rate is important. This process is repeated until the cost function becomes 0 or very close to 0.You need to make sure that this learning rate is not too high nor too low.Ĭoefficient = coefficient – (alpha * del) A learning rate (alpha) can be selected to control how much these coefficients will change in each iteration.

#Batch gradient descent update

After knowing the direction of downhill from the slope, you update the coefficient values accordingly.

The direction should be such that you get a lower cost(error) in the next iteration. Calculating the slope will help you to figure out the direction to move the coefficient values.

We know from the concept of calculus that the derivative of a function is the slope of the function.

The cost function is calculated by putting this value of the coefficient in the function.

The calculation of gradient descent begins with the initial values of coefficients for the function being set as 0 or a small random value. Read: Boosting in Machine Learning: What is, Functions, Types & Features Gradient Descent Algorithm- Methodology However, these bottoms may not be the lowest points and are known as local minima. Depending on the start position of the ball, it may rest on many bottoms of the valley. You want the ball to reach the bottom of the valley, where the bottom of the valley represents the least cost function. The valley is the plot for the cost function here. You can imagine gradient descent as a ball rolling down a valley. This step is repeated until the best coefficients are found. Different values are used as the coefficients to calculate the cost function. The bottom of the bowl is the best coefficient for which the cost function is minimum. This bowl is the plot for the cost function. Suppose you have a large bowl similar to something you’ve your fruit in. Source The intuition behind the Gradient Descent algorithm The point at which cost function is minimum is known as global minima. Gradient descent is one such optimization algorithm used to find the coefficients of a function to reduce the cost function. To achieve this goal, you need to find the required parameters during the training of your model. The goal is to reduce the cost function so that the model is accurate.

#Batch gradient descent

1 note · View note

moreyouread · 4 years ago

Text

Hands on Machine Learning

Chapter 1-2

- batch vs online learning

- instance vs model learning

- hyperparameter grid search

Chapter 3

- 1-precision (x) vs recall (y) is the ROC curve

- true positive rate = recall = sensitivity and true negative rate is = precision = specificity

- harmonic mean to balance precision and recall averages

Chapter 4

- training data with 3 different types of stochastic gradient descent: batch, mini batch, stochastic (with single sample row)

- cross entropy error is minimized for logistic regression

- softmax for multi class predictions. multi-label vs multi-class predictions where labels are mutually exclusive. Softmax is used when mutually exclusive labels.

- softmax helps the gradient not die, while argmax will make it disappear

Chapter 5 SVM

- svm regression is opposite of few points in the street but actually more

- hard vs soft margin classification (like output is a probability vs 1 or 0?)

- kernel trick makes non-linear classification less computationally complex

- dual problem is a problem with a similar or in this case the same mathematical solution as the primal problem of maximizing the distance between the boundaries

- things to better understand: kernel SVM thru Mercer’s condition, how hinge loss applies to SVM solved with gradient descent

Chapter 6

- trees are prone to overfit and regressions are sensitive to the orientation of the data (can be fixed with PCA)

Chapter 7

- ensemble through bagging or pasting: one with replacement and the other without, leading to OOB error

- extra randomized trees when splits on nodes for the tree is done on a random threshold. It’s called random trees bc of using only a subset of features and data points for each tree

- Adaboost (weighting wrong predictions more) vs. gradient boost (adding predictions on all the error residuals)

- stacking is a separate model used to aggregate multiple models instead of a hard vote

Chapter 9 unsupervised

- Silhouette score, balance intra and inter cluster scores, but can do for each cluster to get you a balance within the clusters

- DBSCAN density clustering, sihoulette score to find the optimal epsilon, working well for dense clusters. Don’t need to specify number of clusters

- Gaussian Mixture Model, also density clustering working well for ellipsoid clusters. Do need to specify cluster number, and covariance type of the types of shapes, which would mess it up. It also helps with anomaly detection because of p values. This can’t use silhouette score bc they’re not spherical shapes because of biases of distances.

- Bayesian GMM, similar to lasso for GMM, to set cluster count for you with priors

- Latent class, which is the cluster label of a latent variable

Chapter 13 CNN computer vision

- CNN uses a square to go over pixels in a square, some with zero padding; this is called “convolving”

- the layers are actual horizontal and vertical filters, that the model uses to multiple against inputted image

- these filters can be trained to eventually become pattern detectors. Patterns could be dog faces or even edges

- a pooling layer doesn’t detect patterns but simply averages things together, simplifying complex images

- QUESTION: how does the pattern eventually detect if yes or no for training if something is a dog for instance?

Chapter 8 Dimensionality Reduction

- PCA: projection onto a hyperplane in a dimension, max with the same number of features. The number of top dimensions you pick is your hyper parameter, with the max being the dimensions you are in. The next line is orthogonal for projection

- Kernel PCA: vector is curved or circular, not just 1 straight line. The additional hyper parameter is the shape of the curved lines used. It’s a mathematical transformation used to make different data points linearly separable in a higher dimension (making lines in a lower dimension look curved) without actually having to go to the higher dimension.

- you can decompress by multiplying by the inverse transformation. Then you see how off you are from the actual image, i.e reconstruction error

- another measurement is explained variance ratio for each dimension n, also chosen with an elbow plot

- manifold learning is twisting, unfolding, etc from a 2D space to 3D space

Chapter 14

- RNN predict time series and NLP

- it is a loop with time, each previous layer feeding into the next

- can be shorted with probabilistic dropout and feeding older t-20 to t-1 outputs, to prevent vanishing gradient

- LTSM cell allows you to recognize what’s an important input vs an unimportant one to forget

- encoder vs decoder for machine translation NLP occurs such that encoders are fed in a series as one output to a series of decoders, each with its own output. https://youtu.be/jCrgzJlxTKg

youtube

Chapter 15 autoencoders

a neural function that encodes and decodes, predicting itself (technically unsupervised but is a supervised training neural network with fewer outputs in the middle ie the encoder which simplifies and then the same number of outputs as inputs in the final layer.

GANS used autoencoders to build additional data, and autoencoders are dimensionality reducers.

Questions: how is it reducing dimensionality if the same number of outputs as inputs exist?

It’s helpful for detecting anomalies or even predicting if something is a different class. If the error bar of the output and input is super large, it is likely an anomaly or different class.

https://youtu.be/H1AllrJ-_30

https://youtu.be/yz6dNf7X7SA

Reinforcement learning

Q-learning is a value derived to punish or reward behaviors at each step in reinforcement learning

Reinforcement learning requires doing a lot of steps and getting just 1 success criteria at the end

It can be trained with stochastic gradient descent, boosting the actions with gradient descent that yielded more positive end Q score results

youtube

QUESTIONS

- does waiting longer days increase power? Or does it increase only in so far that sample size increases with more days of new users exposed? More days of data even with the same sample size will decrease std.

#Youtube

1 note · View note

siva3155 · 5 years ago

Text

300+ TOP Deep Learning Interview Questions and Answers

Deep Learning Interview Questions for freshers experienced :-

1. What is Deep Learning? Deep learning is one part of a broader group of machine learning techniques based on learning data analytics designs, as exposed through task-specific algorithms. Deep Learning can be supervised us a semi-supervised or unsupervised. 2. Which data visualization libraries do you use and why they are useful? It is valuable to determine your views value on the data value properly visualization and your individual preferences when one comes to tools. Popular methods add R’s ggplot, Python’s seaborn including matplotlib value, and media such as Plot.ly and Tableau. 3. Where do you regularly source data-sets? This type of questions remains any real tie-breakers. If someone exists going into an interview, he/she need to remember this drill of any related question. That completely explains your interest in Machine Learning. 4. What is the cost function? A cost function is a strength of the efficiency of the neural network data-set value with respect to given sample value and expected output data-set. It is a single value of data-set-function, non-vector as it gives the appearance of the neural network as a whole. MSE=1nΣi=0n(Y^i–Yi)^2 5. What are the benefits of mini-batch gradient descent? This is more efficient of compared tools to stochastic gradient reduction. The generalization data value by determining the flat minima. The Mini-batches provides help to approximate the gradient of this entire data-set advantage which helps us to neglect local minima. 6. What is mean by gradient descent? Gradient descent defined as an essential optimization algorithm value point, which is managed to get the value of parameters that reduces the cost function. It is an iterative algorithm data value function which is moves towards the direction of steepest data value function relationship as described by the form of the gradient. Θ: =Θ–αd∂ΘJ(Θ) 7. What is meant by a backpropagation? It ‘s Forward to the propagation of data-set value function in order to display the output data value function. Then using objective value also output value error derivative package is computed including respect to output activation. Then we after propagate to computing derivative of the error with regard to output activation value function and the previous and continue data value function this for all the hidden layers. Using previously calculated the data-set value and its derivatives the for output including any hidden stories we estimate error derivatives including respect to weights. 8. What is means by convex hull? The convex hull is represents to the outer boundaries of the two-level group of the data point. Once is the convex hull has to been created the data-set value, we get maximum data-set value level of margin hyperplane (MMH), which attempts to create data set value the greatest departure between two groups data set value, as a vertical bisector between two convex hulls data set value. 9. Do you have experience including Spark about big data tools for machine learning? The Spark and big data mean most favorite demand now, able to the handle high-level data-sets value and including speed. Be true if you don’t should experience including those tools needed, but more take a look into assignment descriptions also understand methods pop. 10. How will do handle the missing data? One can find out the missing data and then a data-set value either drop thorugh those rows value or columns value or decide value to restore them with another value. In python library using towards the Pandas, there are two thinging useful functions helpful, IsNull() and drop() the value function.

Deep Learning Interview Questions 11. What is means by auto-encoder? An Auto-encoder does an autonomous Machine learning algorithm data that uses backpropagation system, where that target large values are data-set to be similar to the inputs provided data-set value. Internally, it converts a deep layer that describes a code used to represent specific input. 12. Explain about from Machine Learning in industry. Robots are replacing individuals in various areas. It is because robots are added so that all can perform this task based on the data-set value function they find from sensors. They see from this data also behaves intelligently. 13. What are the difference Algorithm techniques in Machine Learning? Reinforcement Learning Supervised Learning Unsupervised Learning Semi-supervised Learning Transduction Learning to Learn 14. Difference between supervised and unsupervised machine learning? Supervised learning is a method anywhere that requires instruction defined data While Unsupervised learning it doesn’t need data labeling. 15. What is the advantage of Naive Bayes? The classifier preference converge active than discriminative types It cannot learn that exchanges between characteristics 16. What are the function using Supervised Learning? Classifications Speech recognition Regression Predict time series Annotate strings 17. What are the functions using Unsupervised Learning? To Find that the data of the cluster of the data To Find the low-dimensional representations value of the data To Find determine interesting with directions in data To Find the Magnetic coordinates including correlations To Find novel observations 18. How do you understanding Machine Learning Concepts? Machine learning is the use of artificial intelligence that provides operations that ability to automatically detect further improve from occurrence without doing explicitly entered. Machine learning centers on the evolution of network programs that can access data and utilize it to learn for themselves. 19. What are the roles of activation function? The activation function means related to data enter non-linearity within the neural network helping it to learn more system function. Without which that neural network data value would be simply able to get a linear function which is a direct organization of its input data. 20. Definition of Boltzmann Machine? Boltzmann Machine is used to optimize the resolution of a problem. The work of the Boltzmann machine is essential to optimize data-set value that weights and the quantity for data Value. It uses a recurrent structure data value. If we apply affected annealing on discrete Hopfield network, when it would display Boltzmann Machine. Get Deep Learning 100% Practical Training 21. What is Overfitting in Machine Learning? Overfitting in Machine Learning is described as during a statistical data model represents random value error or noise preferably of any underlying relationship or when a pattern is extremely complex. 22. How can you avoid overfitting? Lots of data Cross-validation 23. What are the conditions when Overfitting happens? One of the important design and chance of overfitting is because the models used as training that model is the same as that criterion used to assess the efficacy of a model. 24. What are the advantages of decision trees? The Decision trees are easy to interpret Nonparametric There are comparatively few parameters to tune 25. What are the three stages to build the hypotheses or model in machine learning? Model building Model testing Applying the model 26. What are parametric models and Non-Parametric models? Parametric models remain these with a limited number from parameters also to predict new data, you only need to understand that parameters from the model. Non Parametric designs are those with an unlimited number from parameters, allowing to and flexibility and to predict new data, you want to understand the parameters of this model also the state from the data that has been observed. 27. What are some different cases uses of machine learning algorithms can be used? Fraud Detection Face detection Natural language processing Market Segmentation Text Categorization Bioinformatics 28. What are the popular algorithms for Machine Learning? Decision Trees Probabilistic networks Nearest Neighbor Support vector machines Neural Networks 29. Define univariate multivariate and bivariate analysis? if an analysis involves only one variable it is called as a univariate analysis for eg: Pie chart, Histogram etc. If a analysis involves 2 variables it is called as bivariate analysis for example to see how age vs population is varying we can plot a scatter plot. A multivariate analysis involves more than two variables, for example in regression analysis we see the effect of variables on the response variable 30. How does missing value imputation lead to selection bias? Case treatment- Deleting the entire row for one missing value in a specific column, Implutaion by mean: distribution might get biased for instance std dev, regression, correlation. 31. What is bootstrap sampling? create resampled data from empirical data known as bootstrap replicates. 32. What is permutation sampling? Also known as randomization tests, the process of testing a statistic based on reshuffling the data labels to see the difference between two samples. 33. What is total sum of squares? summation of squares of difference of individual points from the population mean. 34. What is sum of squares within? summation of squares of difference of individual points from the group mean. 35. What is sum of squares between? summation of squares of difference of individual group means from the population mean for each data point. 36. What is p value? p value is the worst case probability of a statistic under the assumption of null hypothesis being true. 37. What is R^2 value? It’s measures the goodness of fit for a linear regression model. 38. What does it mean to have a high R^2 value? the statistic measures variance percentage in dependent variable that can be explained by the independent variables together. 40. What are residuals in a regression model? Residuals in a regression model is the difference between the actual observation and its distance from the predicted value from a regression model. 41. What are fitted values, calculate fitted value for Y=7X+8, when X =5? Response of the model when predictors values are used in the model, Ans=42. 42. What pattern should residual vs fitted plots show in a regression analysis? No pattern, if the plot shows a pattern regression coefficients cannot be trusted. 43. What is overfitting and underfitting? overfitting occurs when a model is excessively complex and cannot generalize well, a overfitted model has a poor predictive performance. Underfitting of a model occurs when the model is not able to capture any trends from the data. 44. Define precision and recall? Recall = True Positives/(True Positives + False Negatives), Precision = True Positives/(True Positives + False Positive). 45. What is type 1 and type 2 errors? False positives are termed as Type 1 error, False negative are termed as Type 2 error. 46. What is ensemble learning? The art of combining multiple learning algorithms and achieve a model with a higher predictive power, for example bagging, boosting. 47. What is the difference between supervised and unsupervised machine learning algorithms? In supervised learning we use the dataset which is labelled and try and learn from that data, unsupervised modeling involves data which is not labelled. 48. What is named entity recognition? It is identifying, understanding textual data to answer certain question like “who, when,where,What etc.” 49. What is tf-idf? It is the measure if a weight of a term in text data used majorly in text mining. It signifies how important a word is to a document. tf -> term frequency – (Count of text appearing in the data) idf -> inverse document frequency tfidf -> tf * idf 50. What is the difference between regression and deep neural networks, is regression better than neural networks? In some applications neural networks would fit better than regression it usually happens when there are non linearity involved, on the contrary a linear regression model would have less parameters to estimate than a neural network for the same set of input variables. thus for optimization neural network would need a more data in order to get better generalization and nonlinear association. 51. How are node values calculated in a feed forward neural network? The weights are multiplied with node/input values and are summed up to generate the next successive node 52. Name two activation functions used in deep neural networks? Sigmoid, softmax, relu, leaky relu, tanh. 53. What is the use of activation functions in neural networks? Activation functions are used to explain the non linearity present in the data. 54. How are the weights calculated which determine interactions in neural networks? The training model sets weights to optimize predictive accuracy. 55. which layer in a deep learning model would capture a more complex or higher order interaction? The last layer. 56. What is gradient descent? It comprises of minimizing a loss function to find the optimal weights for a neural network. 57. Imagine a loss function vs weights plot depicting a gradient descent. At What point of the curve would we achieve optimal weights? local minima. 58. How does slope of tangent to the curve of loss function vs weigts help us in getting optimal weights for a neural network Slope of a curve at any point will give us the direction component which would help us decide which direction we would want to go i.e What weights to consider to achieve a less magnitude for loss function. 59. What is learning rate in gradient descent? A value depicting how slowly we should move towards achieving optimal weights, weights are changedby the subtracting the value obtained from the product of learning rate and slope. 60. If in backward propagation you have gone through 9 iterations of calculating slopes and updated the weights simultaneously, how many times you must have done forward propagation? 9 61. How does ReLU activation function works? Define its value for -5 and +7 For all x>=0, the output is x, for all x Read the full article

0 notes

project1ixdmaster2019 · 6 years ago

Text

15th September

Todays group meeting was all about preparing and structuring our findings. As the next supervision meeting is tomorrow we used the time today to map out our findings, see if we can find similarities, groups and factors that are referring to different key aspects we have collected the past days.

Our homework over the weekend was to gather research paper, browse what is the current state of the art when it comes to sonic experiences.

First thing in the morning was to present to each other what we have found out. Interesting insights about designers sonic spaces in connection to architecture, how body percussion contributes to feeling happy and less stressed, to the difference between aural architecture and acoustic architecture and a better understanding of what stochastic means. After updating each other we tried to map out and understand how the research findings can be connected to our previous fieldwork and general interesting key insights we have gathered before.

The first mapping created our “Hybrid Hall” mapping which shows all of our important fieldwork findings on different post its. We were able to cluster our findings into three sub-groups related to: lifestyle, study environment and general findings.

After that we combined our research findings with the research information we gathered and grouped them around interesting key insights we as a group can see ourselves designing for.

This map shows research findings (white post its), our research findings (green post its) and important key insights (pink post its) that we would love to explore further in this context.

On the right side we placed possible concerns that we would like to address tomorrow such as in what way our design project has to be stochastic as a lot of research that we have found claim that if there is a design concept / determination the definition of stochastic is not applicable anymore. So our question is whether how or in which relation to stochastic we have to design.

On the left side there are general research findings that we were not able to place into our context so far.

We proceeded to prepare the supervision invite for tomorrow:

Summary:

After focusing on the hall in general during the first phase of fieldwork, we have been analyzing more specific places and situations during the second phase (see Fig.1). We observed break behaviour in the hallways in front of the lecture halls and the toilets as a specific place. Both can provoke sonic tensions.

Observations:

Behaviour and activities of people in the hall and at study spaces

Which groups of people do occur

Activities on different floors as well as main differences

Decibel measuring of existing sound level

Break behaviour and activities of students

Focus on “black islands” in the hallway around lecture halls

Interviews:

Students studying on their own or doing group work, mainly on the workplaces

External (individual) persons in the hall

About break behaviour and activities of students

People waiting in front of the toilet

Desktop Research:

Intentions of the architects of the Niagara building

The sound of architecture, acoustics, sonic architecture

Acoustic vs. aural (physical vs. experimental)

Stochastic and stochastic sounds

How to create randomness?

Body percussion and physical activities

How to make visual architecture auditive

Article about the toilet’s embarrassment

Fieldwork still missing:

Expert interviews with university staff responsible for the interior and furniture of Niagara

We are already in contact with them and the janitor

Maybe contacting the architects?

Toilet polls?

Main Insights:

Break activities

Body percussion releases natural endorphins, therefore is a way to release stress.

People like to stand to the rail and observe others downstairs as a meditating activity

A lot of people stay in the classroom and don’t move at all

Some of them decide to walk around the hall

Instead of going to the rooftop, they go downstairs to smoke

Many of the students gather around toilet spaces

Toilet behaviour:

It could be embarrassing to have the feeling of being heard while in the toilets (paper: fecal matters) because of the way the toilets are placed on the floors (people are having a break in the corridor close to the toilets).

The ambiguity of the building’s architecture regarding connectivity and isolation

This space somehow connects people (open space, people see and watch each other) but the students finally isolate each other because they need to study and to have a calm place.

The architect intended this connection with his design

Connection of three different buildings and sectors (ABC)

Want to concentrate on their studies and therefore isolate themselves with headphones to their own soundscape.

Isolation through using smart furniture, that muffle the surrounding sound

Hive feeling

might be seen as a random chaos but people in niagara follow patterns and have a destination as a goal

moving patterns (elevators vs stairs)

Challenges:

How can we make people taking the stairs instead of the crowded elevator?

Is there a way to connect the isolated (headphone using) people with each other?

Can we make people to do more (physical) activities during their break to destress them / change their mind?

How can we connect people to each other?

How can existing moving patterns can be represented in stochastic sound?

How can we create a sonic personality/universe of the building?

How can we represent the visual architecture of the building in an auditive way?

Concerns:

Conflict Designer/Stochastic: How to create randomness - as soon as it is created, there is an intention behind it, so it is not random anymore? Do we create something stochastic or do we create something based on a stochastic input?

Should we collect/look for more case studies or similar projects?

0 notes

wonbindatascience · 6 years ago

Text

Weight Initialization

1. Intro

Optimization? : The search process is incremental from a starting point in the space of possible solutions toward some good enough solution.

Stochastic algorithm? : Search problems are often very challenging and require the use of nondeterministic algorithms that make heavy use of randomness. (The algorithms are not random themselves; instead they make careful use of randomness. They are random within a bound)

Deterministic vs Non-Deterministic : Some problems are hard to solve deterministically because of the computational expensiveness. So an alternate solution is to use nondeterministic algorithms. These are algorithms that use elements of randomness when making decisions during the execution of the algorithm. This means that a different order of steps will be followed when the same algorithm is rerun on the same data. They can rapidly speed up the process of getting a solution, but the solution will be approximate, or “good,” but often not the “best.”

(https://en.wikipedia.org/wiki/Nondeterministic_algorithm)

So? : We know nothing about the structure of the search space. Therefore, to remove bias from the search process, we start from a randomly chosen position. As the search process unfolds, there is a risk that we are stuck in an unfavorable area of the search space (’local optima’). Using randomness during the search process gives some likelihood of getting unstuck and finding a better final candidate solution.

2. In NN

: Artificial neural networks are trained using a stochastic optimization algorithm called stochastic gradient descent. The algorithm uses randomness in order to find a good enough set of weights for the specific mapping function. It means that your specific network on your specific training data will fit a different network with a different model skill each time the training algorithm is run.

2.1 What

: Stochastic optimization algorithms such as stochastic gradient descent use randomness

in selecting a starting point for the search and

in the progression of the search.

2.2 How***

Specifically, stochastic gradient descent requires that the weights of the network are initialized to small random values (random, but close to zero, such as in [0.0, 0.1]).

Randomness is also used during the search process in the shuffling of the training dataset prior to each epoch, which in turn results in differences in the gradient estimate for each batch.

3. In practice

Don’t Set Weights to Zero

When to Initialize to the Same Weights? : only when the model’s used in production environment in real world. But this would not be helpful when evaluating network configurations.

There is no single best way to initialize the weights of a neural network. It is one more hyperparameter for you to explore and test and experiment with on your specific predictive modeling problem.

(https://machinelearningmastery.com/why-initialize-a-neural-network-with-random-weights/)

0 notes

freeninjatraderindicatorsbl-blog · 6 years ago

Text

Trading Price Action Vs. Indicators

Traders all over the globe have been battling over the proper way to day trade. Trading Price action or price action trading as it's called is taking precedence when compared to following Indicators on a chart. What's better? What works? What makes the most profit? Let us compare the two methods of trading and come to a conclusion to which trading style comes out on top. Let us digest the question that most serious trades come to realize. Which method works better, price action or indicators? Price Action as explained and taught by Day Trade To Win is a method of trading where price is used as the primary tool for determining risk and reward in real-time. In terms of online day trading, price action is the price movement displayed on the chart, and nothing more. The price as displayed and plotted on the chart real time can provide traders everything they need. The 5 minute chart seems to be the most popular. Inherently, price action is a singular method of trading, requiring no external trading tools (like indicators), plug-ins, or other third-party software for charting purposes. All that is required is an understanding of the market, and a set of rules for interacting with price and the ability to identify price behavior. Sounds simple enough, right? Well not so fast. The interpretation of the price is what makes or breaks price action trading. Without the proper understanding, education, and specific rules, the chart looks nothing more than random bars. Having the key to decipher the code is what price action is all about. Few traders have passed on this method of trading to others in detail and deciphering this code is not easy. Adversely, indicators are third-party extensions that summarize data, advising traders when and how to trade. Indicators exist for nearly every aspect of day trading for nearly every software platform that supports them. For example, NinjaTrader's indicator list includes indicators ranging from Bollinger Bands to oscillators, moving averages, volume averages, stochastics, and everything in between. Indicators that focus on price action do not exist for the most part, with the exception of those offered at price action exclusive websites. This exception is a stretch, as software like the Atlas Line Indicator is really a price guide; indicating what type of trade to take (long or short) only if price confirms the action. In order for an indicator to be considered compatible with price action, the indicator must: Operate and produce signals in real-time not after the fact. Be compatible with price as it moves on the chart Produce non-conflicting signals that whipsaw a trader Indicators look pretty and have lots of colors. The question is do they really help, or do they create dependency for traders? Let's first understand what an indicator does. A trading indicator needs price to first to make a move up or down. If you cherished this article and you would like to be given more info about ninjatrader 8 footprint chart please visit our webpage. Once this move is made the Indicator takes what just occurred and plots a point - line - bar - graph on the chart. The indicator by definition is already late in providing information to a trader about a move up or down in the market which has already occurred. Indicators also have another huge issue which traders fail to realize. Which parameter is right for the market being traded? What has worked in the past, will most likely not work in the future. If the indicator in question has been optimized with historical data, then how will history relate to the forward looking performance when traded? This becomes the issue at hand. At the moment it seems price action has the advantage with the comparisons made. This is just "Round 1" and the following articles will provide more info, but for now let's understand what each contestant stands for and what each brings to the table. Price action trading: Is free - a trader does not need to extra software. Candles, bars, dots or any other chart price symbol will provide ample information for price action traders. Can be used on any market at any time under any circumstances (E-Mini S only certain markets and / or trading software may be supported. Subject to the law of overuse - the more traders that use an indicator, the more a market will adapt "in retaliation" to its overuse, thus rendering it ineffective. Price action is free from such boundaries as it is based on watching the resulting changes in price. Easy to use, and follow. A no-brainer, nothing to think about and only following the signals is needed. While Price Action trading may be free, it may take a trader quite a while of practice (and a few losses) to determine what works. The logical next step in preventing losses is pursuing a form of day trading education. Indicators are a dime a dozen and most focus on following the heard. Beginner to advanced educational programs are available and some even feature "Private Mentorship", a one-on-one trading from an experienced price action traders. Some program includes exact instructions on scalping methods, filtering trades, trading the news, and much more. Six weeks of live tutoring at the student's own pace is much more effective in creating a self-sufficient day trader than any combination of indicators. Stay tuned for what happens next in "Round 2" of Price Action vs Indicators.

0 notes

misentropy · 5 years ago

Text

deterministic vs. stochastic

We can think of disease patterns as leaning deterministic or stochastic: In the former, an outbreak’s distribution is more linear and predictable; in the latter, randomness plays a much larger role and predictions are hard, if not impossible, to make. In deterministic trajectories, we expect what happened yesterday to give us a good sense of what to expect tomorrow. Stochastic phenomena, however, don’t operate like that—the same inputs don’t always produce the same outputs, and things can tip over quickly from one state to the other. As Scarpino told me, “Diseases like the flu are pretty nearly deterministic and R0 (while flawed) paints about the right picture (nearly impossible to stop until there’s a vaccine).” That’s not necessarily the case with super-spreading diseases.

// Source

#epidemiology #covid19 #statistics #deterministic #stochastic

0 notes

theresawelchy · 6 years ago

Text

Recommendations for Deep Learning Neural Network Practitioners

Deep learning neural networks are relatively straightforward to define and train given the wide adoption of open source libraries.

Nevertheless, neural networks remain challenging to configure and train.

In his 2012 paper titled “Practical Recommendations for Gradient-Based Training of Deep Architectures” published as a preprint and a chapter of the popular 2012 book “Neural Networks: Tricks of the Trade,” Yoshua Bengio, one of the fathers of the field of deep learning, provides practical recommendations for configuring and tuning neural network models.

In this post, you will step through this long and interesting paper and pick out the most relevant tips and tricks for modern deep learning practitioners.

After reading this post, you will know:

The early foundations for the deep learning renaissance including pretraining and autoencoders.

Recommendations for the initial configuration for the range of neural network hyperparameters.

How to effectively tune neural network hyperparameters and tactics to tune models more efficiently.

Let’s get started.

Practical Recommendations for Deep Learning Neural Network Practitioners Photo by Susanne Nilsson, some rights reserved.

Overview

This tutorial is divided into five parts; they are:

Required Reading for Practitioners

Paper Overview

Beginnings of Deep Learning

Learning via Gradient Descent

Hyperparameter Recommendations

Recommendations for Practitioners

In 2012, a second edition of the popular practical book “Neural Networks: Tricks of the Trade” was published.

The first edition was published in 1999 and contained 17 chapters (each written by different academics and experts) on how to get the most out of neural network models. The updated second edition added 13 more chapters, including an important chapter (chapter 19) by Yoshua Bengio titled “Practical Recommendations for Gradient-Based Training of Deep Architectures.”

The time that this second edition was published was an important time in the renewed interest in neural networks and the start of what has become “deep learning.” Yoshua Bengio’s chapter is important because it provides recommendations for developing neural network models, including the details for, at the time, very modern deep learning methods.

Although the chapter can be read as part of the second edition, Bengio also published a preprint of the chapter to the arXiv website, that can be accessed here:

Practical Recommendations for Gradient-Based Training of Deep Architectures, Preprint, 2012.

The chapter is also important as it provides a valuable foundation for what became the de facto textbook on deep learning four years later, titled simply “Deep Learning,” for which Bengio was a co-author.

This chapter (I’ll refer to it as a paper from now on) is required reading for all neural network practitioners.

In this post, we will step through each section of the paper and point out some of the most salient recommendations.

Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Paper Overview

The goal of the paper is to provide practitioners with practical recommendations for developing neural network models.

There are many types of neural network models and many types of practitioners, so the goal is broad and the recommendations are not specific to a given type of neural network or predictive modeling problem. This is good in that we can apply the recommendations liberally on our projects, but also frustrating as specific examples from literature or case studies are not given.

The focus of these recommendations is on the configuration of model hyperparameters, specifically those related to the stochastic gradient descent learning algorithm.

This chapter is meant as a practical guide with recommendations for some of the most commonly used hyper-parameters, in particular in the context of learning algorithms based on backpropagated gradient and gradient-based optimization.

Recommendations are presented in the context of the dawn of the field of deep learning, where modern methods and fast GPU hardware facilitated the development of networks with more depth and, in turn, more capability than had been seen before. Bengio draws this renaissance back to 2006 (six years before the time of writing) and the development of greedy layer-wise pretraining methods, that later (after this paper was written) were replaced by extensive use of ReLU, Dropout, BatchNorm, and other methods that aided in developing very deep models.

The 2006 Deep Learning breakthrough centered on the use of unsupervised learning to help learning internal representations by providing a local training signal at each level of a hierarchy of features.

The paper is divided into six main sections, with section three providing the main reading focus on recommendations for configuring hyperparameters. The full table of contents for the paper is provided below.

Abstract

1 Introduction

1.1 Deep Learning and Greedy Layer-Wise Pretraining

1.2 Denoising and Contractive AutoEncoders

1.3 Online Learning and Optimization of Generalization Error

2 Gradients

2.1 Gradient Descent and Learning Rate

2.2 Gradient Computation and Automatic Differentiation

3 Hyper-Parameters

3.1 Neural Network HyperParameters

3.1.1 Hyper-Parameters of the Approximate Optimization

3.2 Hyper-Parameters of the Model and Training Criterion

3.3 Manual Search and Grid Search

3.3.1 General guidance for the exploration of hyper-parameters

3.3.2 Coordinate Descent and MultiResolution Search

3.3.3 Automated and Semi-automated Grid Search

3.3.4 Layer-wise optimization of hyperparameters

3.4 Random Sampling of HyperParameters

4 Debugging and Analysis

4.1 Gradient Checking and Controlled Overfitting

4.2 Visualizations and Statistics

5 Other Recommendations

5.1 Multi-core machines, BLAS and GPUs

5.2 Sparse High-Dimensional Inputs

5.3 Symbolic Variables, Embeddings, Multi-Task Learning and MultiRelational Learning

6 Open Questions

6.1 On the Added Difficulty of Training Deeper Architectures

6.2 Adaptive Learning Rates and Second-Order Methods

6.3 Conclusion

We will not touch on each section, but instead focus on the beginning of the paper and specifically the recommendations for hyperparameters and model tuning.

Beginnings of Deep Learning

The introduction section spends some time on the beginnings of deep learning, which is fascinating if viewed as a historical snapshot of the field.

At the time, the deep learning renaissance was driven by the development of neural network models with many more layers than could be used previously based on techniques such as greedy layer-wise pretraining and representation learning via autoencoders.

One of the most commonly used approaches for training deep neural networks is based on greedy layer-wise pre-training.

Not only was the approach important because it allowed the development of deeper models, but also the unsupervised form allowed the use of unlabeled examples, e.g. semi-supervised learning, which too was a breakthrough.

Another important motivation for feature learning and Deep Learning is that they can be done with unlabeled examples …

As such, reuse (literal reuse) was a major theme.

The notion of reuse, which explains the power of distributed representations is also at the heart of the theoretical advantages behind Deep Learning.

Although a single or two-layer neural network of sufficient capacity can be shown to approximate any function in theory, he offers a gentle reminder that deep networks provide a computational short-cut to approximating more complex functions. This is an important reminder and helps in motivating the development of deep models.

Theoretical results clearly identify families of functions where a deep representation can be exponentially more efficient than one that is insufficiently deep.

Time is spent stepping through two of the major “deep learning” breakthroughs: greedy layer-wise pretraining (both supervised and unsupervised) and autoencoders (both denoising and contrastive).

The third breakthrough, RBMs were left for discussion in another chapter of the book written by Hinton, the developer of the method.

Restricted Boltzmann Machine (RBM).

Greedy Layer-Wise Pretraining (Unsupervised and Supervised).

Autoencoders (Denoising and Contrastive).

Although milestones, none of these techniques are preferred and used widely today (six years later) in the development of deep learning, and with perhaps with the exception of autoencoders, none are vigorously researched as they once were.

Learning via Gradient Descent

Section two provides a foundation on gradients and gradient learning algorithms, the main optimization technique used to fit neural network weights to training datasets.

This includes the important distinction between batch and stochastic gradient descent, and approximations via mini-batch gradient descent, today all simply referred to as stochastic gradient descent.

Batch Gradient Descent. Gradient is estimated using all examples in the training dataset.

Stochastic (Online) Gradient Descent. Gradient is estimated using subsets of samples in the training dataset.

Mini-Batch Gradient Descent. Gradient is estimated using each single pattern in the training dataset.

The mini-batch variant is offered as a way to achieve the speed of convergence offered by stochastic gradient descent with the improved estimate of the error gradient offered by batch gradient descent.

Larger batch sizes slow down convergence.

On the other hand, as B [the batch size] increases, the number of updates per computation done decreases, which slows down convergence (in terms of error vs number of multiply-add operations performed) because less updates can be done in the same computing time.

Smaller batch sizes offer a regularizing effect due to the introduction of statistical noise in the gradient estimate.

… smaller values of B [the batch size] may benefit from more exploration in parameter space and a form of regularization both due to the “noise” injected in the gradient estimator, which may explain the better test results sometimes observed with smaller B.

This time was also the introduction and wider adoption of automatic differentiation in the development of neural network models.

The gradient can be either computed manually or through automatic differentiation.

This was of particular interest to Bengio given his involvement in the development of the Theano Python mathematical library and pylearn2 deep learning library, both now defunct, succeeded perhaps by TensorFlow and Keras respectively.

Manually implementing differentiation for neural networks is easy to mess up and errors can be hard to debug and cause sub-optimal performance.

When implementing gradient descent algorithms with manual differentiation the result tends to be verbose, brittle code that lacks modularity – all bad things in terms of software engineering.

Automatic differentiation is painted as a more robust approach to developing neural networks as graphs of mathematical operations, each of which knows how to differentiate, which can be defined symbolically.

A better approach is to express the flow graph in terms of objects that modularize how to compute outputs from inputs as well as how to compute the partial derivatives necessary for gradient descent.

The flexibility of the graph-based approach to defining models and the reduced likelihood of error in calculating error derivatives means that this approach has become a standard, at least in the underlying mathematical libraries, for modern open source neural network libraries.

Hyperparameter Recommendations

The main focus of the paper is on the configuration of the hyperparameters that control the convergence and generalization of the model under stochastic gradient descent.

Use a Validation Dataset

The section starts off with the importance of using a separate validation dataset from the train and test sets for tuning model hyperparameters.

For any hyper-parameter that has an impact on the effective capacity of a learner, it makes more sense to select its value based on out-of-sample data (outside the training set), e.g., a validation set performance, online error, or cross-validation error.

And on the importance of not including the validation dataset in the evaluation of the performance of the model.

Once some out-of-sample data has been used for selecting hyper-parameter values, it cannot be used anymore to obtain an unbiased estimator of generalization performance, so one typically uses a test set (or double cross-validation, in the case of small datasets) to estimate generalization error of the pure learning algorithm (with hyper-parameter selection hidden inside).

Cross-validation is often not used with neural network models given that they can take days, weeks, or even months to train. Nevertheless, on smaller datasets where cross-validation can be used, the double cross-validation technique is suggested, where hyperparameter tuning is performed within each cross-validation fold.

Double cross-validation applies recursively the idea of cross-validation, using an outer loop cross-validation to evaluate generalization error and then applying an inner loop cross-validation inside each outer loop split’s training subset (i.e., splitting it again into training and validation folds) in order to select hyper-parameters for that split.

Learning Hyperparameters

A suite of learning hyperparameters is then introduced, sprinkled with recommendations.

The hyperparameters in the suite are:

Initial Learning Rate. The proportion that weights are updated; 0.01 is a good start.

Learning Sate Schedule. Decrease in learning rate over time; 1/T is a good start.

Mini-batch Size. Number of samples used to estimate the gradient; 32 is a good start.

Training Iterations. Number of updates to the weights; set large and use early stopping.

Momentum. Use history from prior weight updates; set large (e.g. 0.9).

Layer-Specific Hyperparameters. Possible, but rarely done.

The learning rate is presented as the most important parameter to tune. Although a value of 0.01 is a recommended starting point, dialing it in for a specific dataset and model is required.

This is often the single most important hyperparameter and one should always make sure that it has been tuned […] A default value of 0.01 typically works for standard multi-layer neural networks but it would be foolish to rely exclusively on this default value.

He goes so far to say that if only one parameter can be tuned, then it would be the learning rate.

If there is only time to optimize one hyper-parameter and one uses stochastic gradient descent, then this is the hyper-parameter that is worth tuning.

The batch size is presented as a control on the speed of learning, not about tuning test set performance (generalization error).

In theory, this hyper-parameter should impact training time and not so much test performance, so it can be optimized separately of the other hyperparameters, by comparing training curves (training and validation error vs amount of training time), after the other hyper-parameters (except learning rate) have been selected.

Model Hyperparameters

Model hyperparameters are then introduced, again sprinkled with recommendations.

They are:

Number of Nodes. Control over the capacity of the model; use larger models with regularization.

Weight Regularization. Penalize models with large weights; try L2 generally or L1 for sparsity.

Activity Regularization. Penalize model for large activations; try L1 for sparse representations.

Activation Function. Used as the output of nodes in hidden layers; use sigmoidal functions (logistic and tang) or rectifier (now the standard).

Weight Initialization. The starting point for the optimization process; influenced by activation function and size of the prior layer.

Random Seeds. Stochastic nature of optimization process; average models from multiple runs.

Preprocessing. Prepare data prior to modeling; at least standardize and remove correlations.

Configuring the number of nodes in a layer is challenging and perhaps one of the most asked questions by beginners. He suggests that using the same number of nodes in each hidden layer might be a good starting point.

In a large comparative study, we found that using the same size for all layers worked generally better or the same as using a decreasing size (pyramid-like) or increasing size (upside down pyramid), but of course this may be data-dependent.

He also recommends using an overcomplete configuration for the first hidden layer.

For most tasks that we worked on, we find that an overcomplete (larger than the input vector) first hidden layer works better than an undercomplete one.

Given the focus on layer-wise training and autoencoder, the sparsity of the representation (output of hidden layers) was a focus at the time. Hence the recommendation of using activity regularization that may still be useful in larger encoder-decoder models.

Sparse representations may be advantageous because they encourage representations that disentangle the underlying factors of representation.

At the time, the linear rectifier activation function was just beginning to be used and had not widely been adopted. Today, using the rectifier (ReLU) is the standard given that models using it readily out-perform models using logistic or hyperbolic tangent nonlinearities.

Tuning Hyperparameters

The default configurations do well for most neural networks on most problems.

Nevertheless, hyperparameter tuning is required to get the most out of a given model on a given dataset.

Tuning hyperparameters can be challenging both because of the computational resources required and because it can be easy to overfit the validation dataset, resulting in misleading findings.

One has to think of hyperparameter selection as a difficult form of learning: there is both an optimization problem (looking for hyper-parameter configurations that yield low validation error) and a generalization problem: there is uncertainty about the expected generalization after optimizing validation performance, and it is possible to overfit the validation error and get optimistically biased estimators of performance when comparing many hyper-parameter configurations.

Tuning one hyperparameter for a model and plotting the results often results in a U-shaped curve showing the pattern of poor performance, good performance, and back up to poor performance (e.g. minimizing loss or error). The goal is to find the bottom of the “U.”

The problem is, many hyperparameters interact and the bottom of the “U” can be noisy.

Although to first approximation we expect a kind of U-shaped curve (when considering only a single hyper-parameter, the others being fixed), this curve can also have noisy variations, in part due to the use of finite data sets.

To aid in this search, he then provides three valuable tips to consider generally when tuning model hyperparameters:

Best value on the border. Consider expanding the search if a good value is found on the edge of the interval searched.

Scale of values considered. Consider searching on a log scale, at least at first (e.g. 0.1, 0.01, 0.001, etc.).

Computational considerations. Consider giving up fidelity of the result in order to accelerate the search.

Three systematic hyperparameter search strategies are suggested:

Coordinate Descent. Dial-in each hyperparameter one at a time.

Multi-Resolution Search. Iteratively zoom in the search interval.

Grid Search. Define an n-dimensional grid of values and test each in turn.

These strategies can be used separately or even combined.

The grid search is perhaps the most commonly understood and widely used method for tuning model hyperparameters. It is exhaustive, but parallelizable, a benefit that can be exploited using cheap cloud computing infrastructure.

The advantage of the grid search, compared to many other optimization strategies (such as coordinate descent), is that it is fully parallelizable.

Often, the process is repeated via iterative grid searches, combining the multi-resolution and grid search.

Typically, a single grid search is not enough and practitioners tend to proceed with a sequence of grid searches, each time adjusting the ranges of values considered based on the previous results obtained.

He also suggests keeping a human in the loop to keep an eye out for bugs and use pattern recognition to identify trends and change the shape of the search space.

Humans can get very good at performing hyperparameter search, and having a human in the loop also has the advantage that it can help detect bugs or unwanted or unexpected behavior of a learning algorithm.

Nevertheless, it is important to automate as much as possible to ensure the process is repeatable for new problems and models in the future.

The grid search is exhaustive and slow.

A serious problem with the grid search approach to find good hyper-parameter configurations is that it scales exponentially badly with the number of hyperparameters considered.

He suggests using a random sampling strategy, which has been shown to be effective. The interval of each hyperparameter can be searched uniformly. This distribution can be biased by including priors, such as the choice of sensible defaults.

The idea of random sampling is to replace the regular grid by a random (typically uniform) sampling. Each tested hyper-parameter configuration is selected by independently sampling each hyper-parameter from a prior distribution (typically uniform in the log-domain, inside the interval of interest).

The paper ends with more general recommendations, including techniques for debugging the learning process, speeding up training with GPU hardware, and remaining open questions.